15 research outputs found

    The pangenome of hexaploid bread wheat

    Get PDF
    There is an increasing understanding that variation in gene presence–absence plays an important role in the heritability of agronomic traits; however, there have been relatively few studies on variation in gene pres- ence–absence in crop species. Hexaploid wheat is one of the most important food crops in the world and intensive breeding has reduced the genetic diversity of elite cultivars. Major efforts have produced draft genome assemblies for the cultivar Chinese Spring, but it is unknown how well this represents the genome diversity found in current modern elite cultivars. In this study we build an improved reference for Chinese Spring and explore gene diversity across 18 wheat cultivars. We predict a pangenome size of 140 500 102 genes, a core genome of 81 070 1631 genes and an average of 128 656 genes in each cultivar. Functional annotation of the variable gene set suggests that it is enriched for genes that may be associated with important agronomic traits. In addition to variation in gene presence, more than 36 million intervarietal sin- gle nucleotide polymorphisms were identified across the pangenome. This study of the wheat pangenome provides insight into genome diversity in elite wheat as a basis for genomics-based improvement of this important crop. A wheat pangenome, GBrowse, is available at http://appliedbioinformatics.com.au/cgi-bin/ gb2/gbrowse/WheatPan/, and data are available to download from http://wheatgenome.info/wheat_ge nome_databases.php

    Assembly and comparison of two closely related Brassica napus genomes

    Get PDF
    As an increasing number of plant genome sequences become available, it is clear that gene content varies between individuals, and the challenge arises to predict the gene content of a species. However, genome comparison is often confounded by variation in assembly and annotation. Differentiating between true gene absence and variation in assembly or annotation is essential for the accurate identification of conserved and variable genes in a species. Here, we present the de novo assembly of the B. napus cultivar Tapidor and comparison with an improved assembly of the Brassica napus cultivar Darmor-bzh. Both cultivars were annotated using the same method to allow comparison of gene content. We identified genes unique to each cultivar and differentiate these from artefacts due to variation in the assembly and annotation. We demonstrate that using a common annotation pipeline can result in different gene predictions, even for closely related cultivars, and repeat regions which collapse during assembly impact whole genome comparison. After accounting for differences in assembly and annotation, we demonstrate that the genome of Darmor-bzh contains a greater number of genes than the genome of Tapidor. Our results are the first step towards comparison of the true differences between B. napus genomes and highlight the potential sources of error in future production of a B. napus pangenome

    The Assembly of Individual Chaplin Peptides from Streptomyces coelicolor into Functional Amyloid Fibrils

    Get PDF
    The self-association of proteins into amyloid fibrils offers an alternative to the natively folded state of many polypeptides. Although commonly associated with disease, amyloid fibrils represent the natural functional state of some proteins, such as the chaplins from the soil-dwelling bacterium Streptomyces coelicolor, which coat the aerial mycelium and spores rendering them hydrophobic. We have undertaken a biophysical characterisation of the five short chaplin peptides ChpD-H to probe the mechanism by which these peptides self-assemble in solution to form fibrils. Each of the five chaplin peptides produced synthetically or isolated from the cell wall is individually surface-active and capable of forming fibrils under a range of solution conditions in vitro. These fibrils contain a highly similar cross-β core structure and a secondary structure that resembles fibrils formed in vivo on the spore and mycelium surface. They can also restore the growth of aerial hyphae to a chaplin mutant strain. We show that cysteine residues are not required for fibril formation in vitro and propose a role for the cysteine residues conserved in four of the five short chaplin peptides

    An efficient approach to BAC based assembly of complex genomes

    Get PDF
    Background: There has been an exponential growth in the number of genome sequencing projects since the introduction of next generation DNA sequencing technologies. Genome projects have increasingly involved assembly of whole genome data which produces inferior assemblies compared to traditional Sanger sequencing of genomic fragments cloned into bacterial artificial chromosomes (BACs). While whole genome shotgun sequencing using next generation sequencing (NGS) is relatively fast and inexpensive, this method is extremely challenging for highly complex genomes, where polyploidy or high repeat content confounds accurate assembly, or where a highly accurate ‘gold’ reference is required. Several attempts have been made to improve genome sequencing approaches by incorporating NGS methods, to variable success. Results: We present the application of a novel BAC sequencing approach which combines indexed pools of BACs, Illumina paired read sequencing, a sequence assembler specifically designed for complex BAC assembly, and a custom bioinformatics pipeline. We demonstrate this method by sequencing and assembling BAC cloned fragments from bread wheat and sugarcane genomes. Conclusions: We demonstrate that our assembly approach is accurate, robust, cost effective and scalable, with applications for complete genome sequencing in large and complex genomes

    Short read alignment using SOAP2

    No full text
    Next-generation sequencing (NGS) technologies have rapidly evolved in the last 5 years, leading to the generation of millions of short reads in a single run. Consequently, various sequence alignment algorithms have been developed to compare these reads to an appropriate reference in order to perform important downstream analysis. SOAP2 from the SOAP series is one of the most commonly used alignment programs to handle NGS data, and it efficiently does so using low computer memory usage and fast alignment speed. This chapter describes the protocol used to align short reads to a reference genome using SOAP2, and highlights the significance of using the in-built command-line options to tune the behavior of the algorithm according to the inputs and the desired results

    SNP discovery using a pangenome: has the single reference approach become obsolete?

    No full text
    Increasing evidence suggests that a single individual is insufficient to capture the genetic diversity within a species due to gene presence absence variation. In order to understand the extent to which genomic variation occurs in a species, the construction of its pangenome is necessary. The pangenome represents the complete set of genes of a species, it is composed of core genes, which are present in all individuals, and variable genes, which are present only in some individuals. Aside from variations at the gene level, single nucleotide polymorphisms (SNPs) are also an important form of genetic variation. The advent of next-generation sequencing (NGS) coupled with the heritability of SNPs make them ideal markers for genetic analysis of human, animal, and microbial data. SNPs have also been extensively used in crop genetics for association mapping, quantitative trait loci (QTL) analysis, analysis of genetic diversity, and phylogenetic analysis. This review focuses on the use of pangenomes for SNP discovery. It highlights the advantages of using a pangenome rather than a single reference for this purpose. This review also demonstrates how extra information not captured in a single reference alone can be used to provide additional support for linking genotypic data to phenotypic data

    Insights into respiratory disease through bioinformatics: bioinformatics and lung diseases

    No full text
    Respiratory diseases such as asthma, chronic obstructive pulmonary disease and lung cancer represent a critical area for medical research as millions of people are affected globally. The development of new strategies for treatment and/or prevention, and the identification of biomarkers for patient stratification and early detection of disease inception are essential to reducing the impact of lung diseases. The successful translation of research into clinical practice requires a detailed understanding of the underlying biology. In this regard, the advent of next‐generation sequencing and mass spectrometry has led to the generation of an unprecedented amount of data spanning multiple layers of biological regulation (genome, epigenome, transcriptome, proteome, metabolome and microbiome). Dealing with this wealth of data requires sophisticated bioinformatics and statistical tools. Here, we review the basic concepts in bioinformatics and genomic data analysis and illustrate the application of these tools to further our understanding of lung diseases. We also highlight the potential for data integration of multi‐omic profiles and computational drug repurposing to define disease subphenotypes and match them to targeted therapies, paving the way for personalized medicine

    Inference of multi-omics networks in plant systems

    No full text
    The inference of gene regulatory networks can reveal molecular connections underlying biological processes and improve our understanding of complex biological phenomena in plants. Many previous network studies have inferred networks using only one type of omics data, such as transcriptomics. However, given more recent work applying multi-omics integration in plant biology, such as combining (phospho)proteomics with transcriptomics, it may be advantageous to integrate multiple omics data types into a comprehensive network prediction. Here, we describe a state-of-the-art approach for integrating multi-omics data with gene regulatory network inference to describe signaling pathways and uncover novel regulators. We detail how to download and process transcriptomics and (phospho)proteomics data for network inference, using an example dataset from the plant hormone signaling field. We provide a step-by-step protocol for inference, visualization, and analysis of an integrative multi-omics network using currently available methods. This chapter serves as an accessible guide for novice and intermediate bioinformaticians to analyze their own datasets and reanalyze published work.This is a preprint of the following chapter: Clark et al, A Practical Guide to Inferring Multi-Omics Networks in Plant Systems published in Plant Gene Regulatory Networks, edited by Kerstin Kaufmann & Klaas Vandepoele, 2023, Humana Press, reproduced with permission of Humana Press. The final authenticated version is available online at: https://doi.org/10.1007/978-1-0716-3354-0_1

    Delineating the Tnt1 Insertion Landscape of the Model Legume Medicago truncatula cv. R108 at the Hi-C Resolution Using a Chromosome-Length Genome Assembly

    No full text
    Legumes are of great interest for sustainable agricultural production as they fix atmospheric nitrogen to improve the soil. Medicago truncatula is a well-established model legume, and extensive studies in fundamental molecular, physiological, and developmental biology have been undertaken to translate into trait improvements in economically important legume crops worldwide. However, M. truncatula reference genome was generated in the accession Jemalong A17, which is highly recalcitrant to transformation. M. truncatula R108 is more attractive for genetic studies due to its high transformation efficiency and Tnt1-insertion population resource for functional genomics. The need to perform accurate synteny analysis and comprehensive genome-scale comparisons necessitates a chromosome-length genome assembly for M. truncatula cv. R108. Here, we performed in situ Hi-C (48×) to anchor, order, orient scaffolds, and correct misjoins of contigs in a previously published genome assembly (R108 v1.0), resulting in an improved genome assembly containing eight chromosome-length scaffolds that span 97.62% of the sequenced bases in the input assembly. The long-range physical information data generated using Hi-C allowed us to obtain a chromosome-length ordering of the genome assembly, better validate previous draft misjoins, and provide further insights accurately predicting synteny between A17 and R108 regions corresponding to the known chromosome 4/8 translocation. Furthermore, mapping the Tnt1 insertion landscape on this reference assembly presents an important resource for M. truncatula functional genomics by supporting efficient mutant gene identification in Tnt1 insertion lines. Our data provide a much-needed foundational resource that supports functional and molecular research into the Leguminosae for sustainable agriculture and feeding the future
    corecore